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Developments in genomics are providing a biological basis for the heterogeneity of clinical 
course and response to treatment that have long been apparent to clinicians. The ability 
to molecularly charactehze human diseases presents new opportunities to develop more 
effective treatments and new challenges for the design and analysis of clinical trials. In 
oncology, treatment of broad populations with regimens that benefit a minority of patients 
is less economically sustainable with expensive molecularly targeted therapeutics. The 
established molecular heterogeneity of human diseases requires the development of new 
paradigms for the design and analysis of randomized clinical trials as a reliable basis for 
predictive medicine. We review prospective designs for the development of new thera- 
peutics and predictive biomarkers to inform their use. We cover designs for a wide range 
of settings. At one extreme is the development of a new drug with a single candidate bio- 
marker and strong biological evidence that marker negative patients are unlikely to benefit 
from the new drug. At the other extreme are phase III clinical trials involving both genome- 
wide discovery of a predictive classifier and internal validation of that classifier. We have 
outlined a prediction based approach to the analysis of randomized clinical trials that both 
preserves the type I error and provides a reliable internally validated basis for predicting 
which patients are most likely or unlikely to benefit from a new regimen. 
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INTRODUCTION 

This dominant paradigm for oncology drug development has been 
rapidly changing. The paradigm for development of cytotoxics 
involved large phase III clinical trials to find relatively small, but 
statistically significant, average treatment effects for target popu- 
lations defined in terms of primary site and stage. The primary 
analysis was relatively simple, consisting of a single statistical test 
of the null hypothesis of no average treatment effect for the intent 
to treat population with regard to a single primary endpoint. Any 
claim of treatment benefit based on subset analysis without an 
overall statistically significant intent to treat analysis was viewed 
with suspicion. 

Randomized clinical trials have made important contributions 
to modern medicine and public health, but they have also led 
to the over-treatment of broad populations of patients, most of 
whom don't benefit from the increasingly expensive drugs and pro- 
cedures shown to have statistically significant average treatment 
effects in increasingly large clinical trials. With the recognition 
of the molecular heterogeneity of cancer and the development 
of molecularly targeted drugs whose effects depend strongly on 
the genomic alterations and genetic background of the tumor, 
the broad eUgibility primary site oriented clinical trial is playing 
a less dominant role. Increasingly sophisticated and cost effec- 
tive biotechnology platforms are providing the tools to develop 
diagnostics that identify the patients most likely to benefit from 
molecularly targeted drugs. 

Tumors of a primary site in many represent a heterogeneous 
collection of diseases that differ in pathophysiology, natural his- 
tory, and sensitivity to treatment. These diseases differ with regard 



to the mutations that cause them and drive their invasion. The 
heterogeneous nature of tumors of the same primary site offers 
new challenges for drug development and clinical trial design. 
Physicians have always known that cancers of the same primary site 
were heterogeneous with regard to natural history and response 
to treatment. This understanding sometimes led to conflicts with 
statisticians over the use of subset analysis in the analysis of clini- 
cal trials. Although most statisticians expressed concern about the 
potential for false positive findings results from post hoc subset 
analysis, some practitioners rejected the results of clinical trials 
whose conclusions were based on average effects. Today we have 
better tools for characterizing the tumors biologically and using 
this characterization in the design and analysis of clinical trials 
that utilize this information prospectively. 

Most oncology drugs are being developed for defined molecu- 
lar targets. In some cases the targets are well understood and there 
is a compelling biological basis for restricting development to the 
subset of patients whose tumors are characterized by deregulation 
of the drug target. For other drugs there are multiple targets and 
more uncertainty about how to measure whether a drug target is 
driving tumor invasion in an individual patient ( I ). It is clear that 
the primary analysis of the new generation of oncology clinical 
trials must consist of more than just treating broad patient pop- 
ulations and testing the null hypothesis of no average effect. But 
it is also clear that the tradition of post hoc data dredging sub- 
set analysis is not an adequate basis for predictive oncology. For 
establishing practice standards and for drug approvals we need 
prospective analysis plans that provide for both preservation of 
the type I experiment-wise error rate and for focused predictive 
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analyses that can be used to reliably select patients in clinical prac- 
tice for use of the new regimen (2-4) . These two primary objectives 
involve co-development of a drug and a companion diagnostic. 

In the following sections we summarize some of the designs 
that are available for the co- development of a drug and com- 
panion diagnostic. Developing new treatments with companion 
diagnostics or predictive biomarkers for identifying the patients 
who benefit does not make drug development simpler, quicker, or 
cheaper as is sometimes claimed. Actually it makes drug develop- 
ment more complex and probably more expensive. But for many 
new oncology drugs it should increase the chance of success. It may 
also lead to more consistency in results among trials and increase 
the proportion of patients who benefit from the drugs they receive. 
This approach also has great potential value for controlling societal 
expenditures on health care. 

The ideal approach to co-development of a drug and compan- 
ion diagnostic involves: (i) identification of a predictive biomarker 
based on understanding the mechanism of action of the drug and 
the role of the drug target in the pathophysiology of the disease. 
This biological understanding should be validated and refined by 
pre-clinical studies and early phase clinical trials. The predictive 
biomarkers for successful cancer drugs have generally involved a 
single gene or protein rather than a multivariate classifier. Mul- 
tivariate classifiers have been found some use as prognostic indi- 
cators that reflect a combination of the pace of the disease and 
the effect of standard therapy. Multivariate classifiers have rarely 
been used as predictive biomarkers for response to specific drugs 
because their use often reflects an incomplete understanding of 
the mechanism of action of the drug or the role of its molecular 
target, (ii) Development of an analytically validated test for mea- 
surement of that biomarker. Analytically validated means that the 
test accurately measures what it is supposed to measure, or if there 
is no gold-standard measurement, that the test is reproducible and 
robust, (iii) Use of that test to design and analyze a new clinical trial 
to evaluate the effectiveness of that drug and how the effectiveness 
relates to the biomarker value. 

In the enrichment and stratified designs described below, bio- 
marker discovery and determination of the threshold of positivity 
is performed prior to the phase III trial. Cancer biology is com- 
plex, however, and it is not always possible to have everything 
sorted out in this way before launching the phase III clinical tri- 
als. We will also discuss designs and prospective analysis plans that 
permit one to adaptively determine the best threshold of positivity 
for the biomarker and designs that incorporate multiple candidate 
biomarkers. 

TARGETED (ENRICHMENT) DESIGNS 

Designs in which eligibility is restricted to those patients consid- 
ered most likely to benefit from the experimental drug are called 
"targeted designs" or "enrichment designs." With an enrichment 
design, the analytically validated diagnostic test is used to restrict 
eligibility for a randomized clinical trial comparing a regimen 
containing a new drug to a control regimen. This approach has 
now been used for pivotal trials of many drugs whose molec- 
ular targets were well understood in the context of the disease. 
Several authors (5-9) studied the efficiency of this approach rela- 
tive to the standard approach of randomizing all patients without 



using the biomarker test at all. The efficiency of the enrichment 
design depends on the prevalence of test positive patients and on 
the effectiveness of the new treatment in test negative patients. 
When fewer than half of the patients are test positive and the 
new treatment is relatively ineffective in test negative patients, the 
number of randomized patients required for an enrichment design 
is dramatically smaller than the number of randomized patients 
required for a standard design. For example, if the treatment is 
completely ineffective in test negative patients, then the ratio of 
number of patients required for randomization in the enrichment 
design relative to the number required for the standard design is 
approximately l/y^ where y denotes the proportion of patients 
who are test positive. The treatment may have some effective- 
ness for test negative patients either because the assay is imperfect 
for measuring deregulation of the putative molecular target or 
because the drug has off-target anti-tumor effects. Even if the new 
treatment is half as effective in test negative patients as in test pos- 
itive patients, however, the randomization ratio is approximately 
4/(y + 1)^. This equals about 2.56 when y = 0.25, i.e., 25% of the 
patients are test positive, indicating that the enrichment design 
reduces the number of required patients to randomize by a factor 
of 2.56. 

The enrichment design was very effective for the development 
of trastuzumab even though the test was imperfect and has sub- 
sequently been improved. Simon and Maitournam (5, 6) also 
compared the enrichment design to the standard design with 
regard to the number of screened patients. We have made the 
methods of sample size planning for the design of enrichment 
trials available on line at http://brb.nci.nih.gov. The web-based 
programs are available for binary and survival/disease-free sur- 
vival endpoints. The planning takes into account the performance 
characteristics of the tests and specificity of the treatment effects. 
The programs provide comparisons to standard non-enrichment 
designs based on the number of randomized patients required and 
the number of patients needed for screening to obtain the required 
number of randomized patients. 

The enrichment design is appropriate for contexts where there 
is a strong biological basis for believing that test negative patients 
will not benefit from the new drug. In such cases, including test 
negative patients may raise ethical concerns and may confuse the 
interpretation of the clinical trial. As described in the section on 
"stratification designs," if test negative patients are to be included 
then one should ensure that a sufficient number of test positive 
patients are included to provide an adequately powered evaluation. 
Often this is not done and instead one sees a mixed population 
of patients in an inadequately sized trial leading to ambiguous 
conclusions. 

The enrichment design does not provide data on the effective- 
ness of the new treatment compared to control for test negative 
patients. Consequently, unless there is compelling biological or 
phase II data that the new drug is not effective in test negative 
patients, the enrichment design may not be adequate to support 
approval of the test. If the biological rationale or phase II data is 
strong, however, then the test can be approved for identifying a 
subset of patients for whom an effective drug exists, rather than 
for distinguishing patients who do and do not benefit from the 
new drug. 
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In oncology, sequencing of tumor DNA to test for point or 
structural alterations in genes whose protein products are drug- 
gable is rapidly becoming part of the standard diagnostic workup 
at advanced cancer centers. Regulatory body approvals of drugs 
for populations defined by such tests will require that the tests be 
shown to have good analytical performance (10). 

BIOMARKER STRATIFIED DESIGN 

When a predictive classifier has been developed but there is not 
compelling biological or phase II data that test negative patients do 
not benefit from the new treatment, it is generally best to include 
both classifier positive and classifier negative in the phase III clin- 
ical trials comparing the new treatment to the control regimen. In 
this case it is essential that an analysis plan be pre-defined in the 
protocol for how the predictive classifier will be used in the analy- 
sis. The analysis plan will generally define the testing strategy for 
evaluating the new treatment in the test positive patients, the test 
negative patients, and overall. The testing strategy must preserve 
the overall type I error of the trial and the trial must be sized to 
provide adequate statistical power for these tests. It is not sufficient 
to just stratify, i.e., balance, the randomization with regard to the 
classifier without specifying a complete analysis plan. The main 
value of "stratifying" (i.e., balancing) the randomization is that it 
assures that only patients with adequate test results will enter the 
trial. Pre-stratification of the randomization is not necessary for 
the validity of inferences to be made about treatment effects within 
the test positive or test negative subsets. If an analytically validated 
test is not available at the start of the trial but will be available by 
the time of analysis, then it may be preferable not to pre-stratify 
the randomization process. 

The purpose of the pivotal trial is to evaluate the new treatment 
overall and in the subsets determined by the pre-specified classifier 
(generally biomarker plus cut-point for positivity). The purpose is 
not to modify or optimize the classifier unless an adaptive design is 
used. Several primary analysis plans have been described (10-12) 
and a web-based tool for sample size planning for some of these 
analysis plans is available at http://brb.nci.nih.gov For example. If 
one has moderate strength evidence that the treatment, if effec- 
tive at all, is likely to be more effective in the test positive cases, 
one might first compare treatment versus control in test positive 
patients using a threshold of significance of 5%. Only if the treat- 
ment versus control comparison is significant at the 5% level in 
test positive patients, will the new treatment be compared to the 
control among test negative patients, again using a threshold of 
statistical significance of 5%. This sequential approach controls 
the overall type I error at 5%. To have 90% power in the test 
positive patients for detecting a 50% reduction in hazard for the 
new treatment versus control at a two-sided 5% significance level 
requires about 88 events of test positive patients. If at the time 
of analysis the event rates in the test positive and test negative 
strata are about equal, then when there are 88 events in the test 
positive patients, there will be about 88(1 — y)ly events in the test 
negative patients where y denotes the proportion of test positive 
patients. If 25% of the patients are test positive, then there will be 
approximately 264 events in test negative patients. This wiU pro- 
vide approximately 90% power for detecting a 33% reduction in 
hazard at a two-sided significance level of 5%. In this case, the trial 



will not be delayed compared to the enrichment design, but a large 
number of test negative patients will be randomized, treated, and 
followed on the study rather than excluded as for the enrichment 
design. This wiU be problematic if one does not, a priori, expect 
the new treatment to be effective for test negative patients. In this 
case it will be important to establish an interim monitoring plan 
to terminate accrual of test negative patients when interim results 
and prior evidence of lack of effectiveness makes it no longer viable 
to enter them. 

In the situation where one has more limited confidence in the 
predictive marker it can be effectively used for a "fall-back" analy- 
sis. In Simon and Wang (13), we proposed an analysis plan in 
which the new treatment group is first compared to the control 
group overall. If that difference is not significant at a reduced 
significance level such as 0.03, then the new treatment is com- 
pared to the control group just for test positive patients. The latter 
comparison uses a threshold of significance of 0.02, or whatever 
portion of the traditional 0.05 not used by the initial test. If the 
trial is planned for having 90% power for detecting a uniform 33% 
reduction in overall hazard using a two-sided significance level of 
0.03, then the overall analysis will take place when there are 297 
events. If the test is positive in 25% of patients and the event rates 
in test positive and test negative patients are about equal at the 
time of analysis, then when there are 297 overall events there will 
be approximately 75 events among the test positive patients. If the 
overall test of treatment effect is not significant, then the subset 
test will have power 0.75 for detecting a 50% reduction in hazard 
at a two-sided 0.02 significance level. By delaying the treatment 
evaluation in the test positive patients power 0.80 can be achieved 
when there are 84 events and power 0.90 can be achieved when 
there are 109 events in the test positive subset. Wang et al. have 
shown that the power of this approach can be improved by taking 
into account the correlation between the overall significance test 
and the significance test comparing treatment groups in the sub- 
set of test positive patients (14). So if, for example a significance 
threshold of 0.03 has been used for the overall test, the significance 
threshold for used for the subset can be somewhat >0.02 and still 
have the overall chance of a false positive claim of any type limited 
to 5%. Real world experience with stratification and enrichment 
designs are described by Freidlin et al. (15) and by Mandreakar 
and Sargent (16). Freidlin et al. (17) describe a randomized phase 
II design for providing information for the design of the phase III 
trial in cases where there is not a strong biological rationale for the 
enrichment approach. 

INTERIM MONITORING OF TEST NEGATIVE PATIENTS 

Interim monitoring of outcome for the test negative patients is 
very important in clinical trials where there is preliminary evi- 
dence that efficacy of the new regimen may be limited to the test 
positive patients. One approach is to perform an interim analy- 
sis focused on the test negative patients using a standard futility 
monitoring statistical plan for the primary endpoint of the clinical 
trial. Such methods are usually either based on the standardized 
treatment effect or the conditional power of rejecting the null 
hypothesis at the end of the trial. One simple approach is to com- 
pute the standardized treatment effect in the test negative patients 
at a time when half of the events in test negative patients projected 
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to occur by the end of the trial have occurred. If the treatment 
effect is going in the wrong direction, then accrual to the test neg- 
ative stratum ceases. This type of futility analysis is designed to 
be conservative enough that the power at the end of the trial for 
detecting a treatment effect is minimally reduced. This type of 
futility monitoring is used in the design proposed by Wang et al. 
(14) but in many cases it provides very limited protection for test 
negative patients for use in biomarker driven designs. Depending 
on the accrual rate and survival distributions, by the time half 
of the primary endpoint events have occurred for the test nega- 
tive patients, the accrual of test negative patients may be close to 
complete. 

An alternative approach would be to base the futility moni- 
toring of the test negative patients on an intermediate endpoint 
rather than on the primary endpoint of the trial. There would 
be no assumption that the intermediate endpoint is a true surro- 
gate for the primary endpoint, only that if there is no treatment 
effect on the intermediate endpoint, then there is unlikely to be 
a treatment effect for the primary endpoint. With this limited 
assumption, made for most phase II trials, the futility analysis can 
be performed at an earlier time so that a finding of futility will 
limit the number of test negative patients accrued. 

In Karuri and Simon (18) we introduced a phase III design 
for this setting in which futility monitoring of the test negative 
patients is performed based on a joint prior joint distribution for 
the treatment effects in test negative and test positive patients. 
That prior distribution enables the trial investigator to represent 
the prior evidence that treatment effect will be reduced for test neg- 
ative patients and use that information in monitoring the clinical 
trial. Although the formulation is Bayesian, the rejection region 
based on posterior probability is calibrated so that type I errors 
satisfy the usual frequentist requirements. 

BIOMARKER ADAPTIVE THRESHOLD DESIGN 

In Jiang et al. (19) we reported on a "Biomarker Adaptive Thresh- 
old Design" for situations where a biomarker is available at the 
start of the trial, but a cut-point for converting the value to a binary 
classifier is not established. For example, this design could be used 
with a FISH assay for EGFR positivity without pre-specification of 
the threshold of positivity. Tumor specimens are collected from all 
patients at entry, but the value of the biomarker is not used as an 
eligibility criteria. Their analysis plan does not stipulate that the 
assay for measuring the index needs to be performed in real time. 
Two analysis plans were described. Analysis plan A begins with 
comparing outcomes for all patients receiving the new treatment 
to those for all control patients. If this difference in outcomes is sig- 
nificant at a pre-specified reduced significance level ai (e.g., 0.03) 
then the new treatment is considered effective for the eligible pop- 
ulation as a whole. Otherwise, a second stage test is performed 
using significance threshold a2 = 0.05 — ai . The second stage test 
involves finding the cut-point s* for the biomarker score which 
leads to the largest treatment effect in comparing T to C restricted 
to patients with score greater than s*. Jiang et al. employed a 
log-likelihood measure of treatment effect and let L* denote the 
log-likelihood of treatment effect when restricted to patients with 
biomarker level above s*. The null distribution of L* was deter- 
mined by repeating the analysis after permuting the treatment 



and control labels a thousand or more times, recomputing s* and 
L* each time. If the permutation statistical significance of L* is 
<0.05 — «! (e.g., 0.02), then treatment T is considered superior to 
C for the subset of the patients with biomarker level above s*. 

The advantage of procedure A is its simplicity and that it explic- 
itly separates the test of treatment effect in the broad population 
from the subset selection. However, the procedure takes a con- 
servative approach in adjusting for multiplicity of combining the 
overall and subset tests. An alternative analysis plan B proposed 
by Jiang et al. does not use a first stage comparison of treatment 
groups overall. Consequently, plan B is more appropriate to set- 
tings in which there is greater expectation that treatment effect will 
be limited to a marker defined subset. With analysis plan B they 
determine the cut-point value b at which w{b)S(b) is maximized, 
where w(b) is a pre-defined weight function. The weight function 
is used to give greater emphasis to the b = 0 subset, that is, the 
subset containing all patients (marker value is initially normalized 
to the 0-1 interval). Let T(b) = w{b)S(b) denote the value of the 
maximized weighted partial log-likelihood. The statistical signifi- 
cance of T(b) is determined by generating the null distribution by 
repeating the optimization procedure for many cases of randomly 
permuted data. With either procedure A or B, a confidence interval 
for the optimal cut-point b is generated by bootstrap re-sampling 
of the maximum likelihood estimate of the cut-point based on 
a proportional hazards model with an unknown cut-point and 
an unknown treatment effect for patients with biomarker values 
above the cut-point. Since the treatment is presumed effective only 
for patients with biomarker above the threshold b, the confidence 
coefficient associated with a given biomarker value x can be inter- 
preted as the probability that a patient with marker value x benefits 
from the new treatment. 

In Jiang et al. (19) we also provided an approach to sample 
size planning for the biomarker adaptive threshold design. With 
analysis strategy A, sample size is determined in the traditional 
manner for overall comparison of the treatment arms but power- 
ing the trial for using a reduced significance level ai, e.g., 0.03. 
With analysis plan B a larger sample size is used to provides 
good power for establishing the statistical significance of treat- 
ment effects restricted to patients with biomarker values above an 
initially unknown cut-point. 

ADAPTIVE ENRICHMENT DESIGNS 

The adaptive threshold design described above (19) enables one 
to conduct the phase III clinical trial without pre-specifying the 
cut-point for the biomarker. It provides for a valid statistical sig- 
nificance test that has good statistical power against alternative 
hypotheses that the treatment effect is limited to patients with 
biomarker values above some unknown level, and it provides a 
confidence interval for estimation of the cut-point. These analy- 
ses are, however, performed at the end of the trial and accrual 
during the trial is not restricted by biomarker value. In Simon 
and Simon (20), we introduced a very general class of adaptive 
enrichment designs in which the eligibility criteria are adaptively 
adjusted during the course of the trial in order to exclude patient 
subsets unlikely to benefit from the new regimen. Others have also 
studied adaptive enrichment designs (21-23). Wang et al. (21) and 
Simon and Simon (20) provide general frameworks for adaption 
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and identify statistical significance tests that provide protection of 
the study-wise type I error under broad conditions. In Simon and 
Simon (20) we appHed this framework to the setting of adaptive 
threshold enrichment of a single biomarker. 

DESIGNS THAT EVALUATE A SMALL NUMBER OF 
BIOMARKERS 

Because of the complexity of cancer biology, there are many cases 
in which the biology of the target is not sufficiently well under- 
stood at the time that the phase III trials are initiated to restrict 
attention to a single predictive biomarker. The analysis plan used in 
the adaptive threshold design ( 19) is based on computing a global 
test based on a maximum test statistic. For the adaptive threshold 
design, the maximum is taken over the set of cut-points of a bio- 
marker score. The idea of using a global maximum test statistic 
is much more broadly applicable, however. For example, suppose 
multiple candidate binary tests, 5i , . . ., 5jf are available at the start 
of the trial. These tests may or may not be correlated with each 
other. Let Lj- denote the log-likelihood of treatment effect for com- 
paring T to C when restricted to patients positive for biomarker k. 
Let L* denote the largest of these values and let k* denote the test 
for which the maximum is achieved. As for the adaptive threshold 
design, the null distribution of L* can be determined by repeat- 
ing the analysis after permuting the treatment and control labels a 
thousand or more times. If the permutation statistical significance 
of L* is <0.05 — «! (e.g., 0.02), then treatment T is considered 
superior to C for the subset of the patients positive for biomarker 
test k*. The stability of the indicated set of patients who benefit 
from T (i.e., k*) can be evaluated by repeating the computation of 
k* for bootstrap samples of patients. This approach can be useful 
when the number of candidate biomarkers is small, as it should 
be by the time a phase III trial is initiated. Some of the adaptive 
enrichment designs (20) can also be employed in that setting with 
multiple biomarker candidates with or without known cut-points 
of positivity. 

ADAPTIVE CLASSIFICATION BASED ON SCREENING 
CANDIDATE BIOMARKERS 

Designs such as the "adaptive signature design" have been devel- 
oped for adaptive multivariate classifier development and internal 
validation based on high dimensional genomic tumor characteri- 
zation (24). This design employs a "learn and confirm" structure 
in which a portion of the patients are used to select the biomarker 
hypothesis, i.e., to develop an "indication classifier" which identi- 
fies the target population of patients in which the test treatment 
is most likely to be effective, and to use the remainder of the 
patients to test the treatment effect in that subset. The adaptive 
signature design does not modify eligibility criteria. It is adaptive 
in the sense that the treatment effect is tested in a single subset 
determined based on the clinical trial data but in a manner that 
separates classifier development from testing of treatment effect. 
This is dramatically different than the current practice of ad hoc 
analysis in multiple subsets with no control of type I error or in 
using the full dataset to both develop a classifier and to classify 
patients for purpose of hypothesis testing. Since the adaptive sig- 
nature design does not use the patients on which the classifier was 
developed for the testing of the treatment effect, it thus avoids the 



inflation of type I error described by Wang et al. (25) for other 
approaches. Scher et al. described the use of the adaptive signa- 
ture design for planning a pivotal trial in advanced prostate cancer 
(26). The key idea of the adaptive signature approach is to replace 
multiple significance testing based subset analysis with develop- 
ment and internal validation of a single "indication classifier" that 
informs treatment selection for individual patients based on their 
entire vector of covariate values. 

The adaptive signature design approach is very general with 
regard to the methodology applied to the training set for identi- 
fying the single candidate subset in which treatment effect will be 
tested in the validation set. In many cases this can be accomplished 
by developing a model for predicting outcome as a function of 
treatment, selected biomarkers and treatment by biomarker inter- 
actions. In the original adaptive signature design paper this was 
accomplished by screening all the candidate biomarkers using pre- 
dictive models that include the main effect of treatment, main 
effect of a single biomarker, and the corresponding interaction of 
that biomarker with treatment. Candidate markers which exhib- 
ited an interaction nominally significant at a pre-specified level 
were included in a final multivariate predictive model. A machine 
learning weighted voting model was used in the original paper to 
classify patients as either likely to benefit from the new treatment 
or not likely to benefit from the new treatment. The tuning para- 
meters for this classifier were optimized by cross-validation in the 
training set. The multivariate model was then used to classify the 
patients in the validation set, and the treatment effect was eval- 
uated in the subset of the patients in the validation set that were 
classified as likely to benefit from the new treatment based on the 
classifier developed in the training set. 

Many other methods of classifier development can be employed 
using the training set. It is important to recognize, however, that 
one is not developing a prognostic classifier. The classifier is used 
to classify patients as likely to benefit from the new treatment. 
One could develop prognostic classifiers separately for the treat- 
ment and control groups using standard penalized regression 
methods and then classify patients based on which prognostic 
classifier predicts the better outcome. More commonly, however, 
single predictive models have been used based on screening candi- 
date markers based on their univariate interaction with treatment. 
Matsui et al. (27) used their model to predict a continuous score 
reflecting the expected benefit for the new treatment relative to 
the control rather than just classifying patients into one of two 
subsets. Gu et al. (28) have developed a two-step strategy for 
developing a model for predicting outcome as a function of treat- 
ment and selected biomarkers. The biomarkers are selected using 
a group lasso approach in which the main effects of a biomarker 
are grouped with the interactions of that marker with treatments 
and can be used with two or more treatments. 

Freidlin et al. (29) described further extensions of the adaptive 
signature approach. They use cross-validation to replace sample 
splitting of the trial into a training set and test set in order to 
increase the statistical power. 

CONCLUSION 

Recognition of the molecular heterogeneity of human diseases 
such as cancers of a primary site and the tools for characterizing 
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this heterogeneity presents new opportunities for the develop- 
ment of more effective treatments and challenges for the design 
and analysis of clinical trials. In oncology, treatment of broad 
populations with regimens that do not benefit most patients 
is less economically sustainable with expensive molecularly tar- 
geted therapeutics and less likely to be successful. The established 
molecular heterogeneity of human diseases requires the devel- 
opment of new approaches to use randomized clinical trials to 
provide a reliable basis predictive medicine (3, 4). This paper 
has attempted to review here some prospective phase III designs 
for the co-development of new therapeutics with companion 
diagnostics. 
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